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Multivariate Analysis is an increasingly common tool in experimental high energy physics; however, many of the 
common approaches were borrowed from other fields. We clarify what the goal of a multivariate algorithm should 
be for the search for a new particle and compare different approaches. We also translate the Neyman-Pcarson 
theory into the language of statistical learning theory. 



1. Introduction 

Multivariate Analysis is an increasingly common 
tool in experimental high energy physics; however, 
most of the common approaches were borrowed from 
other fields. Each of these algorithms were developed 
for their own particular task, thus they look quite dif- 
ferent at their core. It is not obvious that what these 
different algorithms do internally is optimal for the the 
tasks which they perform within high energy physics. 
It is also quite difficult to compare these different al- 
gorithms due to the differences in the formalisms that 
were used to derive and/or document them. In Sec- 
tion El we introduce a formalism for a Learning Ma- 
chine, which is general enough to encompass all of 
the techniques used within high energy physics. In 
Sections |3] & 0] we review the statistical statements 
relevant to new particle searches and translate them 
into the formalism of statistical learning theory. In 
the remainder of the note, we look at the main re- 
sults of statistical learning theory and their relevance 
to some of the common algorithms used within high 
energy physics. 



2. Formalism 

Formally a Learning Machine is a family of func- 
tions T with domain / and range O parametrized by 
a e A. The domain can usually be thought of as, or 
at least embedded in, M. d and we generically denote 
points in the domain as x. The points x can be re- 
ferred to in many ways {e.g. patterns, events, inputs, 
examples, . . . ). The range is most commonly R, [0, 1], 
or just {0, 1}. Elements of the range are denoted by 
y and can be referred to in many ways {e.g. classes, 
target values, outputs, . . . ). The parameters a spec- 
ify a particular function f a £ T and the structure of 
a G A depends upon the learning machine Q, • 

In the modern theory of machine learning, the per- 
formance of a learning machine is usually cast in the 
more pessimistic setting of risk. In general, the risk, 
R, of a learning machine is written as 



where Q measures some notion of loss between f a {x) 
and the target value y. For example, when classifying 
events, the risk of mis-classification is given by Eq. ^ 
with Q{x, y; a) = \y — f a {x)\. Similarly, for regression 1 
tasks one takes Q{x,y;a) = (y — f a {x)) 2 . Most of 
the classic applications of learning machines can be 
cast into this formalism; however, searches for new 
particles place some strain on the notion of risk. 



3. Searches for New Particles 

The conclusion of an experimental search for a new 
particle is a statistical statement - usually a decla- 
ration of discovery or a limit on the mass of the hy- 
pothetical particle. Thus, the appropriate notion of 
performance for a multivariate algorithm used in a 
search for a new particle is that performance mea- 
sure which will maximize the chance of declaring a 
discovery or provide the tightest limits on the hypo- 
thetical particle. In principle, it should be a fairly 
straight-forward procedure to use the formal statis- 
tical statements to derive the most appropriate per- 
formance measure. This procedure is complicated by 
the fact that experimentalists (and statisticians) can- 
not settle on a formalism to use {i.e. Bayesians vs. 
Frequentists) . As an example, let us consider the Fre- 
quentist theory developed by Neyman and Pearson . 
This was the basis for the results of the search for the 
Standard Model Higgs boson at LEP 3 . 

The Neyman-Pearson theory (which we review 
briefly for completeness) begins with two Hypothe- 
ses: the null hypothesis Ho and the alternate hypoth- 
esis Hi In the case of a new particle search Hq 
is identified with the currently accepted theory {i.e. 
the Standard Model) and is usually referred to as the 
"background-only" hypothesis. Similarly, H\ is iden- 
tified with the theory being tested usually referred to 
as the "signal-plus-background" hypothesis 



R{a) 



Q(x,y;a) p{x,y)dxdy 



(1) 



1 During the presentation, J. Friedman did not distinguish 
between these two tasks; however, in a region with p(x, 1) = b 
and p(x, 0) = 1 — b, the optimal f(x) for classification and 
regression differ. For classification, f(x) = {1 if b > 1/2, else 0}, 
and for regression the optimal f(x) = b. 
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Next, one defines a region W £ I such that if the 
data fall in W we accept the null hypothesis (and re- 
ject the alternate hypothesis) 2 . Similarly, if the data 
fall in / — W we reject the null hypothesis and accept 
the alternate hypothesis. The probability to commit 
a Type I error is called the size of the test and is given 
by (note alternate use of a) 



p(x\Ho)dx. 



i-w 



The probability to commit a Type II error is given by 



0- 



w 



p(x\H\)dx. 



(3) 



Finally, the Neyman-Pearson lemma tells us that the 
region W of size a which minimizes the rate of Type II 
error (maximizes the power) is given by 



W 



K4gi) 

p(x\H ) 



> k a 



(4) 



4. The Neyman-Pearson Theory in the 
Context of Risk 

In Section we provided the loss functional Q ap- 
propriate for the classification and regression tasks; 
however, we did not provide a loss functional for 
searches for new particles. Having chosen the 
Neyman-Pearson theory as an explicit example, it is 
possible to develop a formal notion of risk. 

Once the size of the test, a, has been agreed upon, 
the notion of risk is the probability of Type II er- 
ror p. In order to return to the formalism out- 
lined in Section [2J identify Hi with y = 1 and 
H with y = 0. Let us consider learning ma- 
chines that have a range R which we will compose 
with a step function f(x) = Q(f a (x) — k a ) so that 
by adjusting k a we insure that the acceptance re- 
gion W has the appropriate size. The region W 
is the acceptance region for Hq, thus it corresponds 
to W = {x\f(x) = 0} and I - W = {x\f(x) = 
1}. We can also translate the quantities p(x\H n ) 
and p(x\Hi) into their learning-theory equivalents 
p(x\0) = p(x,0)/p(0) = S(y)p(x,y)/ jp(x,0)dx and 
6(1 — y)p{x,y)/ J p(x,l)dx, respectively. With these 
substitutions we can rewrite the Neyman-Pearson the- 
ory as follows. A fixed size gives us the global con- 
straint 



J 6(/ a (x) - k a ) S(y) p(x,y))dxdy 
J p(x, 0)dx 



(5) 



2 With m measurements, we should actually consider the 
data as (xi,...,x m ) £ I" 1 , but, for ease of notation, let us 
only consider m = 1. 



and the risk is given by 

/[l-e(/ a (a;)-A a )] p(x,l)dx 



P = 



J p(x, l)dx 
oc / Q(f a (x) + k a ) 5(1 - y) p(x,y)dxdy. 



(G) 



Extracting the integrand we can write the loss func- 
tional as 

Q(x,y;a)=e(f a (x) + k a )S(l-y). (7) 

Unfortunately, Eq.^does not allow for the global con- 
straint imposed by k a (which is implicitly a functional 
of f a ), but this could be accommodated by the meth- 
ods of Euler and Lagrange. Furthermore, the con- 
straint cannot be evaluated without explicit knowl- 
edge of p(x,y). 



4.1. Asymptotic Equivalence 

Certain approaches to multivariate analysis lever- 
age the many powerful theorems of statistics assum- 
ing one can explicitly refer to p(x,y). This depen- 
dence places a great deal of stress on the asymptotic 
ability to estimate p(x, y) from a finite set of samples 
{(x,y)i}. There are many such techniques for esti- 
mating a multivariate density function p(x, y) given 
the samples 0, 0. Unfortunately, for high dimen- 
sional domains, the number of samples needed to en- 
joy the asymptotic properties grows very rapidly; this 
is known as the curse of dimensionality. 

In the case that there is no (or negligible) interfer- 
ence between the signal process and the background 
processes one can avoid the complications imposed 
by quantum mechanics and simply add probabili- 
ties. This is often the case with searches for new 
particles, thus the signal-plus-background hypothe- 
sis can be rewritten p(x, |£/i) = n s p s (x) + ribPb(x), 
where n s and Uf, are normalization constants that 
sum to unity. This allows us to rewrite the contours 
of the likelihood ratio as contours of the signal-to- 
background ratio. In particular the contours of the 
likelihood ratio p(x\H\) / p(x\H$) = k a can be rewrit- 
ten as p s (x)/pb(x) = (k a - n b )/n s = k' a . 

The kernel estimation techniques described in this 
conference represent a particular statistical approach 
in which classification is achieved by cutting on a dis- 
criminant function D(x) Q- The discriminant func- 
tion D(x) = Ps(x)/ (p s (x) + Pb(x)) is one-to-one with 
p s (x)/pb(x) (which is in turn one-to-one with the like- 
lihood ratio). These correspondences are only valid 
asymptotically, and the ability to accurately approxi- 
mate p(x) from an empirical sample is often far from 
ideal. However, for particle physics applications, up to 
5-dimensional multivariate analyses have shown good 
performance Q. Furthermore, they have the added 
benefit that they can be easily understood 
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4.2. Direct vs. Indirect Methods 

The loss functional defined in Eq. 0is derived from 
a minimization on the rate of Type II error. This 
is logically distinct from, but asymptotically equiv- 
alent to, approximating the likelihood ratio. In the 
case of no interference, this is logically distinct from, 
but asymptotically equivalent to, approximating the 
signal-to-background ratio. In fact, most multivari- 
ate algorithms are concerned with approximating an 
auxiliary function that is one-to-one with the likeli- 
hood ratio. Because the methods are not directly con- 
cerned with minimizing the rate of Type II error, they 
should be considered indirect methods. Furthermore, 
the asymptotic equivalence breaks down in most ap- 
plications, and the indirect methods are no longer op- 
timal. Neural networks, kernel estimation techniques, 
and support vector machines all represent indirect so- 
lutions to the search for new particles. The Genetic 
Programming (GP) approach presented in Section [5] 
is a direct method concerned with optimizing a user- 
defined performance measure. 



5. Statistical Learning Theory 

The starting point for statistical learning theory is 
to accept that we might not know p{x,y) in any an- 
alytic or numerical form. This is, indeed, the case 
for particle physics, because only {{x,y)i} can be ob- 
tained from the Monte Carlo convolution of a well- 
known theoretical prediction and complex numerical 
description of the detector. In this case, the learn- 
ing problem is based entirely on the training samples 
{{x,y)i} with I elements. The risk functional is thus 
replaced by the empirical risk functional 




1 ' 

Rem P (a) = -^Q{x l ,y i ;a). 



(8) 



There is a surprising result that the true risk 
R(a) can be bounded independent of the distribution 
p(x,y). In particular, for < Q(x,y;a) < 1 



R{a)<R 



cmp 




h{\og{2l/h) + 1) - logfo/4) 
I 



(9) 

where h is the Vapnik-Chervonenkis (VC) dimension 
and r\ is the probability that the bound is violated. As 
7/ — > 0, h — > oo, or I — + the bound becomes trivial. 

The VC dimension is a fundamental property of a 
learning machine J 7 , and is defined as the maximal 
cardinality of a set which can be shattered by J-. "A 
set {xi} can be shattered by JF" means that for each of 
the 2 h binary classifications of the points {x^, there 
exists a f a e T which satisfies yi — f a (xi). A set 
of three points can be shattered by an oriented line 



Figure 1: Example of an oriented line shattering 3 
points. Solid and empty dots represent the two classes 
for y and each of the 2 3 permutations are shown. 



as illustrated in Figure ^ Note that for a learning 
machine with VC dimension h, not every set of h ele- 
ments must be shattered by but at least one. 

Eq.[H]is a remarkable result which relates the num- 
ber of training examples I, a fundamental property 
of the learning machine h, and the risk R indepen- 
dent of the unknown distribution p(x, y). The bounds 
provided by Eq. [5] are relatively weak due to their 
stunning generality. 

It is important to realize that with an independent 
testing sample one can evaluate the true risk arbi- 
trarily well. This testing sample, by definition, is not 
known to the algorithm, so the bound is useful for 
the design of algorithms through structural risk min- 
imization. However, neural networks and most other 
methods rely on an independent testing sample to aid 
in their design and validation. An independent testing 
sample is clearly a better way to assess the true risk 
of a multivariate algorithm; however, Eq. Eldoes shed 
light on the issues of overtraining, suggests the num- 
ber of training samples that are needed, and offers a 
tool to compare different algorithms. 



5.1. VC Dimension of Neural Networks 

In order to apply Eq.[5| one must determine the VC 
dimension of neural networks. This is a difficult prob- 
lem in combinatorics and geometry aided by algebraic 
techniques. Eduardo Sontag has an excellent review of 
these techniques and shows that the VC dimension of 
neural networks can, thus far, only be bounded fairly 
weakly [9j- In particular, if we define p as the number 
of weights and biases in the network, then the best 
bounds are p 2 < h < p 4 . In a typical particle physics 
neural network one can expect 100 < p < 1000, which 
translates into a VC dimension as high as 10 12 , which 
implies I > 10 13 for reasonable bounds on the risk. 
These bounds imply enormous numbers of training 
samples when compared to a typical training sample 
of 10 5 . Sontag goes on to show that these shattered 
sets are incredibly special and that the set of all shat- 
tered sets of cardinality p > 2p + 1 is measure zero in 
general. Thus, perhaps a more relevant notion of the 
VC dimension of a neural network is given by p,. 
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6. Genetic Programming and Algorithms 

Genetic Programming (GP) and Genetic Algo- 
rithms (GA) are based on a similar evolutionary 
metaphor in which "individuals" (potential solutions 
to the problem at hand) compete with respect to a 
user-defined performance measure. For new particle 
searches, the rate of Type II error, the significance, 
the exclusion potential, or G. Punzi's suggestion 0] 
are all reasonable performance measures. Ideally, one 
would use as a performance measure the same pro- 
cedure that will be used to quote the results of the 
experiment. For instance, there is no reason (other 
than speed) that one could not include discriminat- 
ing variables and systematic error in the optimization 
procedure (in fact, the author has done both). 

The use of GP for the classification is fairly lim- 
ited; however, it can be traced to the early works on 
the subject by Koza To the best of the author's 
knowledge, the first application of GP within particle 
physics will appear in [12j . The difference between the 
algorithms is that GAs evolve a bit string which typ- 
ically encodes parameters to a pre-existing program, 
function, or class of cuts, while GP directly evolves 
the programs or functions. For example, Field and 
Kanev [13j used Genetic Algorithms to optimize the 
lower- and upper-bounds for six 1-dimensional cuts 
on Modified Fox- Wolfram "shape" variables. In that 
case, the phase-space region was a pre-defined 6-cube 
and the GA was simply evolving the parameters for 
the upper- and lower-bounds. On the other hand, GP 
algorithm is not constrained to a pre-defined shape or 
parametric form. Instead, the GP approach is con- 
cerned directly with the construction of an optimal, 
non-trivial phase space region {i.e. an acceptance re- 
gion W) with respect to a user-defined performance 
measure. GPs which only produce polynomial expres- 
sions form a vector space, which allows for a quick 
approximation of their VC dimension 9] . 

7. Conclusions 

Clearly multivariate algorithms will have an in- 
creasingly important role in high energy physics, 
which necessitates that the field develop a coherent 
formalism and carefully consider what it means for 
a method to be optimal. Statistical learning theory 
offers a formalism that is general enough to describe 
all of the common multivariate analysis techniques, 
and provides interesting results relating risk, the num- 
ber of training samples, and the learning capacity of 
the algorithm. However, independent testing sam- 
ples and the global constraint on the rate of Type I 
error places some strain on the risk formalism. Fi- 
nally, when one takes into account limited training 



data and systematic errors it is not clear that indirect 
methods are truly optimizing an experiments sensitiv- 
ity. Direct methods, such as Genetic Programming, 
force analysts to be more clear about what statistical 
statements they plan to make and remove an artificial 
boundary between the goals of the experiment and the 
optimization procedures of the algorithm. 
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